Skip to content

Guarantee missing stream promise delivery #12207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 17, 2025

Conversation

werkt
Copy link
Contributor

@werkt werkt commented Jul 10, 2025

In observed cases, whether RST_STREAM or another failure from netty or the server, listeners can fail to be notified when a connection yields a null stream for the selected streamId. This causes hangs in clients, despite deadlines, with no obvious resolution.

This is not simply a race between netty to deliver a result interpreted as a failure and the setSuccess previously implemented, the netty layer does not report the stream as failed.

Fixes #12185

@kannanjgithub
Copy link
Contributor

Thanks for the PR. Can you fix the failing unit tests?

@werkt werkt force-pushed the null-stream-promise branch 2 times, most recently from 470738a to be78878 Compare July 11, 2025 14:02
@werkt
Copy link
Contributor Author

werkt commented Jul 11, 2025

@kannanjgithub tests pass locally with modifications (not sure if they're suitable given the logic switch), but the checks don't seem to be rerunning.

@kannanjgithub kannanjgithub added the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 14, 2025
@grpc-kokoro grpc-kokoro removed the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 14, 2025
@werkt
Copy link
Contributor Author

werkt commented Jul 14, 2025

@kannanjgithub the only failure in Linux artifacts for Kokoro seems to be a content issue with apache's hosting from curl (retried the command locally to have it pass).

@ejona86
Copy link
Member

ejona86 commented Jul 15, 2025

It seems clear we need a unit test that triggers this, because we'd be very likely to break it again. I don't actually see what case is being missing in the current code. The comment seems to be talking about a case that already has a test covering it. So do things just need a slight additional tweak to trigger the breakage?

public void cancelBufferedStreamShouldChangeClientStreamStatus() throws Exception {
// Force the stream to be buffered.
receiveMaxConcurrentStreams(0);
// Create a new stream with id 3.
ChannelFuture createFuture = enqueue(
newCreateStreamCommand(grpcHeaders, streamTransportState));
assertEquals(STREAM_ID, streamTransportState.id());
// Cancel the stream.
cancelStream(Status.CANCELLED);
assertTrue(createFuture.isSuccess());
verify(streamListener).closed(eq(Status.CANCELLED), same(PROCESSED), any(Metadata.class));
}

This does seem like a good line of investigation. Note that I've only skimmed this as of yet, so I could be misunderstanding.

@werkt
Copy link
Contributor Author

werkt commented Jul 15, 2025

@ejona86 I'm out of my depth here, so I apologize if this doesn't make any sense.

  • The handler that permits the stream to proceed to outboundWrites cannot locate its streamId on the connection.
  • It states in the original implementation's comment that this has a specific case of RST_STREAM having occurred, but I don't see how to confirm this - I'm adding debugging to my current tracing to try to figure out when we're stuck if there's any active streams, or if the streamId ever may have existed.
  • The handler then absolves responsibility and says that the connection should have delivered a CANCELLED to the listener, without making any assertions that this has taken place.
  • If this is the expected behavior of netty (that it always deliver a message to a listener that I can't follow the logic on how it is registered), then this sounds like a bug in netty, right?

Based on your ask, and the observation of this situation, you're looking for a test which exhibits this lack of notification on the listener (where netty is missing this delivery)? If so, doesn't there need to be a higher level listener registered than just at the handler level (that we assert was not called before delivering the promise failure)?

@werkt
Copy link
Contributor Author

werkt commented Jul 15, 2025

The state of the connection looks suspect. This is debugging output for each of my outstanding calls that indicates we're stuck:

(13:00:53) WARNING: Still Incomplete: Write 1, 0 failsafes: buildfarm/uploads/320c6987-17fd-4557-8825-dd206be1ba68/blobs/blake3/b5faea3085e4ecce461dbc1e100c6e5696390e4ce01503f693c3e8e80c5d3469/4350, running for 429s, position 0 (last query: -1), queries 0/0, state: Uploading (waiting for ready), But we've never been ready, And onReady was never called, Tracer: CREATED|HANDLER_STREAM_CREATING|OPTIONAL_LABEL: stream=237, numActiveStreams=27, {17, 29, 33, 45, 47, 59, 103, 111, 117, 135, 153, 155, 163, 171, 175, 179, 181, 183, 185, 187, 189, 191, 193, 195, 197, 199, 201}, streamMayHaveExisted=true

For all of the stuck streams, the id is greater than the highest active stream. These active streams don't seem to be doing anything, but exist in the plurality that would imply how many stuck calls I have. All of them 'may have existed', for whatever that means.

Looking at DefaultHttp2Connection.java in netty, I don't see any concern for concurrent modifications of these sequence ids, streamMap, or activeStreams. Maybe I'm missing some serialization that is preventing any possible threaded interaction with connection().local(), but I don't understand otherwise how this is not far more prevalent.

@ejona86
Copy link
Member

ejona86 commented Jul 15, 2025

I think the only case where future.isSuccess() && http2Stream == null is guaranteed to have called transportReportStatus(stopDelivery=true) before the listener is run. So it shouldn't end up mattering whether we do promise.setSuccess() or promise.setFailure().

CANCEL is unlikely to be the right status code. How about we use INTERNAL when failing the promise with a message making it clear that should never happen. Maybe: "Sending headers succeeded but there was no http2Stream. The stream should already be killed and this status will be discarded"

That guarantees that in the case of a bug the RPC will still become closed, and it means we'll likely see a bug report to investigate such a case if it happens, to figure out what went wrong. And it won't be a hang; hangs are horrible to debug.

But that also means I don't think this fixes the issue you are hitting.


In the comment, "a stream buffered in the encoder" should be referring to Netty's StreamBufferingEncoder, which requires 1) the server to have a MAX_CONCURRENT_STREAMS and 2) it to be exceeded by the client. Could that be happening for you?

I'll note that transportReportStatus() is the normal way a listener is closed; the promise handling for writeHeaders() is a special case for when the RPC fails before Netty created the stream.

The RST_STREAM I think is referring to (note that it calls transportReportStatus() before the RST_STREAM), as it is the only trigger of writeRstStream() in our code:

if (reason != null) {
stream.transportReportStatus(reason, true, new Metadata());
}
if (!cmd.stream().isNonExistent()) {
encoder().writeRstStream(ctx, stream.id(), Http2Error.CANCEL.code(), promise);
} else {
promise.setSuccess();
}

The reason != null was added for the case the server completed the call. I see three places that create CancelClientStreamCommand, all in NettyClientStream. Maybe one of them has a bug where reason == null causing transportReportStatus() to not be called:

  • transportHeadersReceived(): this uses null for the reason, but is fine because transportTrailersReceived() calls transportReportStatus()
  • http2ProcessingFailed. Already calls transportReportStatus(), so reason doesn't matter, although it is known non-null because transportReportStatus() has a checkNotNull().
  • cancel(). Is called by AbstractClientStream.cancel(), which has a checkNotNull()

So no bug there.


Looking more at StreamBufferingEncoder, and the places it removes streams from pendingStreams:

  • writeRstStream(). That's the case we are handling already. There is a risk here that there is a RST_STREAM being sent from the client that we aren't aware of, but since the stream hasn't been created in Netty yet there's not a lot of code that could do that
  • close() will fail the promise
  • tryCreatePendingStreams() will fail the promise if an exception occurs.
  • cancelGoAwayStreams() will fail the promise

So no bug there.

@werkt
Copy link
Contributor Author

werkt commented Jul 15, 2025

@ejona86 hangs are horrible to debug. - this.
I have this hanging in front of me at will, see my previous comment about the current state of things. I can poke at nearly any layer of what's happening here, is there anything you want me to do about getting more information out of this that could prevent this from attempting to bandaid whatever is really going wrong?

@ejona86
Copy link
Member

ejona86 commented Jul 15, 2025

Looking at DefaultHttp2Connection.java in netty, I don't see any concern for concurrent modifications of these sequence ids, streamMap, or activeStreams. Maybe I'm missing some serialization that is preventing any possible threaded interaction with connection().local(), but I don't understand otherwise how this is not far more prevalent.

I still need to stare at the connection state more, but the model for Netty is all the state changes happen on a single thread: the event loop. There are multiple threads ("event loop group"), but when a connection is created it chooses one and uses it for its lifetime. A single event loop can handle multiple connections.

So there's no need for synchronization. However callback ordering can get pretty nasty; we have definitely seen problems in the past with callbacks being executed in a bad order and some call is made directly without popping up the stack first (e.g., reentrancy).

@ejona86
Copy link
Member

ejona86 commented Jul 15, 2025

I have this hanging in front of me at will

at will. Nice. Does this PR fix the hang?

@werkt
Copy link
Contributor Author

werkt commented Jul 15, 2025

I have this hanging in front of me at will

at will. Nice. Does this PR fix the hang?

Yes, it does in all cases. I haven't proven that the expected null stream case ever occurs where we DO receive the CANCELLED (as implied/intended by the unchanged implementation), but this overreach in terms of exception delivery guarantees that we fall back to what becomes a retry (and eventually resolves).

@ejona86
Copy link
Member

ejona86 commented Jul 15, 2025

I think I found it. I'll have to think a bit for how to fix it. This PR can definitely go in because it would prevent the RPCs from hanging forever, but I do think we should be using INTERNAL, because this was indeed a bug.

I realized I missed mentioning the case that the stream was created but then was killed before the callback was called. I considered it, but any further writes would have their callbacks called in appropriate order. And the RPC isn't known to the remote yet, so there shouldn't be any RST_STREAM.

But I missed GOAWAY and purposefully closing the stream, coupled with the write being buffered by Netty core I/O, not by StreamBufferingEncoder. If the stream is created but the receiving peer (the server) is a bit slower, then the HEADERS will be enqueued waiting to be sent.

There's two cases I see:

  • forcefulClose(). This calls transportReportStatus(), but only if the grpc transport state has been set on the Netty stream. But that happens in the callback in createStreamTraced() only after the headers have been sent. Note that this is only relevant when calling channel.shutdownNow()
  • goingAway(). The same as the last one; it would be unable to call transportReportStatus() if the headers haven't been written yet

Both of those "races" can only happen to the last few RPCs on a connection, as the HEADERS have to be buffered locally still.

In both of these cases, we have a proper error, but just aren't communicating it to the stream. We want to avoid making a status in createStreamTraced() because that method doesn't actually know the reason the RPC died.

@werkt
Copy link
Contributor Author

werkt commented Jul 15, 2025

I think I found it. I'll have to think a bit for how to fix it. This PR can definitely go in because it would prevent the RPCs from hanging forever, but I do think we should be using INTERNAL, because this was indeed a bug.

Sounds good, offer still stands to add any debug logging/trace to my at will reproducer to solve this the right way.

Just to confirm: I change CANCELLED to INTERNAL and you're a green stamp? Do you want the exception creation outside of createStreamTraced() for this?

@ejona86
Copy link
Member

ejona86 commented Jul 15, 2025

@werkt, this should fix what you are seeing. It isn't a full fix, because it doesn't work if the stream was buffered initially by StreamBufferingEncoder and then later by the I/O subsystem when the GOAWAY was received. But I doubt you are seeing that case.

diff --git a/netty/src/main/java/io/grpc/netty/NettyClientHandler.java b/netty/src/main/java/io/grpc/netty/NettyClientHandler.java
index a5fa0f800..b455180bb 100644
--- a/netty/src/main/java/io/grpc/netty/NettyClientHandler.java
+++ b/netty/src/main/java/io/grpc/netty/NettyClientHandler.java
@@ -768,6 +768,10 @@ class NettyClientHandler extends AbstractNettyHandler {
             }
           }
         });
+    Http2Stream http2Stream = connection().stream(streamId);
+    if (http2Stream != null) {
+      http2Stream.setProperty(streamKey, stream);
+    }
   }
 
   /**

That is being run immediately after the encoder().writeHeaders(), before the listener is likely executed.

@ejona86
Copy link
Member

ejona86 commented Jul 15, 2025

I change CANCELLED to INTERNAL and you're a green stamp?

Yes. That is strictly better while also not ignoring the fact that there is a bug in the code. But you'll probably start noticing the failure. My small patch should fix that for you (but not in all cases). We can get both of those fixes in our next release (originally scheduled for last week, but now looking like this week).

Do you want the exception creation outside of createStreamTraced() for this?

Really, the normal "proper" thing is to call stream.transportReportStatus(status, RpcProgress.MISCARRIED, true, new Metadata()), similar to the error path in that listener already. At that point we could still do promise.setSuccess() and avoid the exception creation all together.

But I'm quite willing to pay the exception creation, if it avoids an RPC hang. This is apparently pretty rare for most people, too, because I think this bug has always existed in grpc-java. (Aren't you lucky, to find such a bug!)

I expect your client and server aren't in the same datacenter; you're going over a slower link of some sort, or just pushing a lot of bytes, or the server is under some CPU pressure. With that in mind, I can understand how many people wouldn't see this, because it requires a race of sending an RPC when receiving a GOAWAY while the TCP connection is fully buffered.

@ejona86 ejona86 added the TODO:backport PR needs to be backported. Removed after backport complete label Jul 15, 2025
@werkt
Copy link
Contributor Author

werkt commented Jul 15, 2025

@ejona86 Yahtzee. Your change alone also fixes the hang preliminarily. I'm going to run my overnight test to confirm for sure (set one up for the previous fix). Let me know what you want to do here, happy to close this assuming it turns out the same.

Yes, this is definitely against a server with high latency (22ms) and limited connectivity (1Gbps from client), in extreme concurrency (~100 streams/channel) with nearly full link saturation and cpu utilization. Server is well equipped for this, and is actually going through nginx to reach java again on the receive side.

ejona86 added a commit to ejona86/grpc-java that referenced this pull request Jul 15, 2025
In grpc#12185, RPCs were randomly hanging. In grpc#12207 this was tracked down
to the headers promise completing successfully, but the netty stream
was null. This was because the headers write hadn't completed but
stream.close() had been called by goingAway().
@ejona86
Copy link
Member

ejona86 commented Jul 15, 2025

I'd hope to still get this PR in. I'm doing my other half in #12222; this will be the back-up for when things still go wrong (as we know they still can).

In observed cases, whether RST_STREAM or another failure from netty or
the server, listeners can fail to be notified when a connection yields a
null stream for the selected streamId. This causes hangs in clients,
despite deadlines, with no obvious resolution.

Tests which relied upon this promise succeeding must now change.
@werkt werkt force-pushed the null-stream-promise branch from be78878 to 2a766c7 Compare July 16, 2025 00:01
@werkt
Copy link
Contributor Author

werkt commented Jul 16, 2025

I'd hope to still get this PR in. I'm doing my other half in #12222; this will be the back-up for when things still go wrong (as we know they still can).

So be it. I changed CANCELLED to INTERNAL. Tests will still fail in this condition, so leaving them alone.

@ejona86 ejona86 added the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 16, 2025
@grpc-kokoro grpc-kokoro removed the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 16, 2025
@werkt
Copy link
Contributor Author

werkt commented Jul 16, 2025

@ejona86 I'm unable to get the kokoro tests to run locally, does the test failure make sense to you?

@ejona86
Copy link
Member

ejona86 commented Jul 17, 2025

They were just flakes. I restarted it earlier and it seems to be passing now. The Java 17 is also a flake.

@ejona86 ejona86 merged commit a37d3eb into grpc:master Jul 17, 2025
15 of 16 checks passed
ejona86 added a commit that referenced this pull request Jul 17, 2025
In #12185, RPCs were randomly hanging. In #12207 this was tracked down
to the headers promise completing successfully, but the netty stream
was null. This was because the headers write hadn't completed but
stream.close() had been called by goingAway().
ejona86 added a commit to ejona86/grpc-java that referenced this pull request Jul 17, 2025
In grpc#12185, RPCs were randomly hanging. In grpc#12207 this was tracked down
to the headers promise completing successfully, but the netty stream
was null. This was because the headers write hadn't completed but
stream.close() had been called by goingAway().
ejona86 added a commit that referenced this pull request Jul 18, 2025
In #12185, RPCs were randomly hanging. In #12207 this was tracked down
to the headers promise completing successfully, but the netty stream
was null. This was because the headers write hadn't completed but
stream.close() had been called by goingAway().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TODO:backport PR needs to be backported. Removed after backport complete
Projects
None yet
Development

Successfully merging this pull request may close these issues.

blockingUnaryCall withDeadlineAfter RPC request hangs forever
4 participants